Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rntbd health check improvement 2 #33464

Merged
merged 14 commits into from
Feb 16, 2023

Conversation

xinlian12
Copy link
Member

@xinlian12 xinlian12 commented Feb 13, 2023

Continuous efforts to improvement Rntbd health check flow, especially for timeout detection.

Why the changes are needed
Based on few recent latency investigations, there are few patterns being identified which demand more aggressively connection closure.

  • Express routes: there are cases being identified that when express routes VM being deployed(restarted etc), the existing connections will not be closed, hence it causing all the connections connected to the express route VM in broken status. Establishing new connections are the only ways to recover.
  • For write workload: Write operation will only be sent to primary replica, so even during retry, it has higher chances that the request will be retried on the same connection.
  • For request timed out on an connection has been idled for a while

Changes included in this PR:

Introduced few more internal parameters to control how fast the channel will be closed when transit timeout being detected.

timeoutDetectionEnabled: Default true
timeoutDetectionDisableCPUThreshold: Default 90.0
timeoutDetectionTimeLimit: Default 60s
timeoutDetectionHighFrequencyThreshold: Default 3
timeoutDetectionHighFrequencyTimeLimit: Default 10s
timeoutDetectionOnWriteThreshold: Default 1
timeoutDetectionOnWriteTimeLimit: Default 6s

Few timeout scenarios would trigger a connection to be closed:

  • Within the timeoutDetectionTimeLimit: timeout has been observed, it does not matter how many timeout have been observed. This will help to detect a broken connection for sparse workload.
  • timeoutDetectionHighFrequencyThreshold + timeoutDetectionHighFrequencyTimeLimit: Timeout has happened very frequently, in this case, we would want to close the channel more frequently.
  • timeoutDetectionOnWriteThreshold + timeoutDetectionOnWriteTimeLimit: Timeout happened on write related operation. Since for write operation, only primary replica will be used, so we want to close the channel more aggressively as well.
  • timeoutDetectionDisableCPUThreshold: High CPU can cause high number of request timeout, when this happens, closing existing channels and re-establishing new ones will not help the situation but rather make it worse. When the cpu threshold being hit, timeout detection will be disabled and then it will automatically resumed when the CPU usage back below the configured threshold.
The above configs can be modified throughput system property:
            System.setProperty("COSMOS.TCP_HEALTH_CHECK_TIMEOUT_DETECTION_ENABLED", "false");
 System.setProperty(
                "azure.cosmos.directTcp.defaultOptions",
                "{"\"timeoutDetectionTimeLimit\":\"PT61S\", \"timeoutDetectionHighFrequencyThreshold\":\"4\", " +
                    "\"timeoutDetectionHighFrequencyTimeLimit\":\"PT11S\", \"timeoutDetectionOnWriteThreshold\":\"2\"," +
                    "\"timeoutDetectionOnWriteTimeLimit\":\"PT7S\"}");

NetworkRequestTimeout: changing Min allowed value from 5s to 1s.

Adding channel statistic in the CosmosDiagnostics

  • New Connection:

image

  • Reusing existing connection:

image

  • Timeout happened on the channel:

image

@azure-sdk
Copy link
Collaborator

API change check

APIView has identified API level changes in this PR and created following API reviews.

azure-cosmos

@xinlian12
Copy link
Member Author

/azp run java - cosmos - spark

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Annie - Kudos! Very clear design and implementation.

@xinlian12
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Member Author

Failed tests:
openConnectionsAndInitCachesWithInvalidCosmosClientConfig - not caused the change in this PR, will fix in a separate Pr
before_ParallelDocumentQueryTest - Test locally and succeeded

@xinlian12
Copy link
Member Author

/check-enforcer override

@xinlian12 xinlian12 merged commit cd4e903 into Azure:main Feb 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants